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Abstract 

For the task of recognizing dialogue acts, we are 
applying the Transformation-Based Learning 
(TBL) machine learning algorithm. To circum- 
vent a sparse data problem, we extract values 
of well-motivated features of utterances, such 
as speaker direction, punctuation marks, and a 
new feature, called dialogue act cues, which we 
find to be more effective than cue phrases and 
word n-grams in practice. We present strate- 
gies for constructing a set of dialogue act cues 
automatically by minimizing the entropy of the 
distribution of dialogue acts in a training cor- 
pus, filtering out irrelevant dialogue act cues, 
and clustering semantically-related words. In 
addition, to address limitations of TBL, we in- 
troduce a Monte Carlo strategy for training ef- 
ficiently and a committee method for comput- 
ing confidence measures. These ideas are com- 
bined in our working implementation, which la- 
bels held-out data as accurately as any other 
reported system for the dialog ue act tagging 
task. 

Introduction 

Although machine learning approaches have 
achieved success in many areas of Natural Lan- 
guage Processing, researchers have only recently 
begun to investigate applying machine learn- 
ing methods to discourse-level problems (Re- 
ithinger and Klesen, 1997; Di Eugenio et al., 
1997; Wiebe et al., 1997; Andernach, 1996; Lit- 
man, 1994). An important task in discourse 
understanding is to interpret an utterance's di- 
alogue act, which is a concise abstraction of a 
speaker's intention, such as SUGGEST and AC- 
CEPT. Recognizing dialogue acts is critical for 
discourse-level understanding and can also be 
useful for other applications, such as resolving 
ambiguity in speech recognition. However, com- 



puting dialogue acts is a challenging task, be- 
cause often a dialogue act cannot be directly 
inferred from a literal interpretation of an ut- 
terance. 

We have investigated applying Transforma- 
tion-Based Learning (TBL) to the task of com- 
puting dialogue acts. This method, which has 
not been used previously in discourse, has a 
number of attractive characteristics for our task. 
However, it also has some limitations, which we 
address with a Monte Carlo strategy that sig- 
nificantly improves the training time efficiency 
without compromising accuracy and a commit- 
tee method that enables TBL to compute con- 
fidence measures for the dialogue acts assigned 
to utterances. 

Our machine learning algorithm makes use 
of abstract features extracted from utterances. 
In addition, we utilize an entropy-minimization 
approach to automatically identify dialogue act 
cues, which are words and short phrases that 
serve as signals for dialogue acts. Our experi- 
ments demonstrate that dialogue act cues tend 
to be more effective than cue phrases and word 
n-grams, and this strategy can be further im- 
proved by adding a filtering mechanism and a 
semantic-clustering method. Although we still 
plan to implement more modifications, our sys- 
tem has already achieved success rates compa- 
rable to the best reported results for computing 
dialogue acts. 

Transformation-Based Learning 

To compute dialogue acts, we are using a mod- 
ified version of Brill's (1995a) Transformation- 
Based Learning method. Given a tagged train- 
ing corpus, TBL develops a learned model that 
consists of a sequence of rules. For example, in 
one experiment, our system produced 213 rules; 
the first five rules are presented in Figure 1. To 
label a new corpus of dialogues with dialogue 



acts, the rules are applied, in turn, to every ut- 
terance in the corpus, and each utterance that 
satisfies the conditions of a rule is relabeled with 
that rule's new tag. For example, the first rule 
in Figure 1 labels every utterance with the tag 
SUGGEST. Then, after the second, third, and 
fourth rules are applied, the fifth rule changes 
an utterance's tag to REJECT if it includes the 
word "no", and the preceding utterance is cur- 
rently tagged SUGGEST. Note that an utter- 
ance's tag may change several times as the dif- 
ferent rules in the sequence are applied. 



# 


Condition (s) 


New Tag 


1 


none 


SUGGEST 


2 


Includes "see" & "you" 


BYE 


3 


Includes "sounds" 


ACCEFT 


4 


Length < 4 words 
Free, tag is none^ 


GREET 


5 


Includes "no" 

Free, tag is SUGGEST 


REJECT 



Figure 1: Rules produced by the system 



To develop a sequence of rules from Scd 
training corpus, TBL attempts to produce rules 
that will correctly label many of the utterances 
in the training data. The system first gener- 
ates all of the potential rules that would make 
at least one label in the training corpus correct. 
For each potential rule, its improvement score is 
defined to be the number of correct tags in the 
training corpus after the rule is applied minus 
the number of correct tags in the training cor- 
pus before the rule is applied. The potential rule 
with the highest improvement score is applied 
to the entire training corpus and output as the 
next rule in the learned model. This process re- 
peats (using the new tags assigned to utterances 
in the training corpus), producing one rule for 
each pass through the training data, until no 
rule can be found with an improvement score 
that surpasses some predefined threshold, O. 

Since there are potentially an infinite number 
of rules that could produce the dialogue acts 
in the training data, it is necessary to restrict 
the range of patterns that the system can 
consider by providing a set of rule templates. 
The system replaces variables in the templates 
with appropriate values to generate rules. 
For example, the following template can be 

^This condition is true only for the first utterance of 
a dialogue. 



instantiated with w="no", X=SUGGEST, 
and Y=REJECT to produce the last rule in 
Figure 1. 

IF utterance u contains the word w 

AND the tag on the utterance preceding u is X 

THEN change u's tag to Y 

We have observed that TBL has a number 
of attractive characteristics for the task of com- 
puting dialogue acts. TBL has been effective on 
a similar^ task, Fart-of-Speech Tagging (Brill, 
1995a). Also, TBL's rules are relatively intu- 
itive, so a human can analyze the rules to deter- 
mine what the system has learned and perhaps 
develop a theory. TBL is very good at discard- 
ing irrelevant rules, because the effect of irrel- 
evant rules on a training corpus is essentially 
random, resulting in low improvement scores. 
In addition, our implementation can accommo- 
date a wide variety of different types of features, 
including set- valued features, features that con- 
sider the context of surrounding utterances, and 
features that can take distant context into ac- 
count. These and other attractive characteris- 
tics of TBL are discussed further in Samuel et 
al. (1998b). 

Dialogue Act Tagging 

To address a significant concern in machine 

learning, called the sparse data problem, we 
must select an appropriate set of features. Re- 
searchers in discourse, such as Grosz and Sidner 
(1986), Lambert (1993), Hirschbcrg and Litman 
(1993), Chen (1995), Andernach (1996), Samuel 
(1996), and Chu-Carroh (1998) have suggested 
several features that might be relevant for the 
task of computing dialogue acts. Our system 
can consider the following features of an ut- 
terance: 1) the cue phrases^ in the utterance; 
2) the word n-grams^ in the utterance; 3) the 
dialogue act cues^ in the utterance; 4) the en- 
tire utterance for one-, two-, or three-word ut- 
terances; 5) speaker information^ for the utter- 

■^Thc part-of-spcccli tag of a word is dependent on the 
word's internal features and on the surrounding words; 
similarly, the dialogue act of an utterance is dependent 
on the utterance's internal features and on the surround- 
ing utterances. 

^This feature is defined later in this section. 

*In our system, we are handling speaker information 
differently from the previous research. For example, Rei- 
thinger and Klesen (1997) combine the speaker direction 



ance; 6) the punctuation marks found in the 
utterance; 7) the number of words in the ut- 
terance; 8) the dialogue acts on the preceding 
utterances; and 9) the dialogue acts on the fol- 
lowing^ utterances. Other features that we still 
plan to implement include: 10) surface speech 
acts, to represent the syntactic structure of the 
utterance in an abstract format; 11) the focus- 
ing information, specifying which preceding ut- 
terance should be considered the most salient 
when interpreting the current utterance; 12) the 
type of the subject of the utterance; and 13) the 
type of the main verb of the utterance. 

Like other researchers, we recognize that 
the specific word substrings (words and short 
phrases) in an utterance can provide impor- 
tant clues for discourse processing, so we should 
utilize a feature that captures this informa- 
tion. Hirschberg and Litman (1993) and Knott 
(1996) have identified sets of cue phrases. Un- 
fortunately, we have found that these manually- 
selected sets of cue phrases are insufficient for 
our task, as they were motivated by different 
domains and tasks, and these sets may be in- 
complete. 

Reithinger and Klesen (1997) utilized word 
n-grams, which are all of the word substrings 
(with a reasonable bound on the length) in the 
training corpus. However, although TBL is ca- 
pable of discarding irrelevant rules, if it is bom- 
barded by an overwhelming number of irrele- 
vant rules, performance may begin to suffer. 
This is because the improvement scores of ir- 
relevant rules are random, so if the system gen- 
erates too many of these rules, some of their 
scores might, by chance, be high enough for se- 
lection in the final model, where they can affect 
performance on new data. 

As a happy medium between the two ex- 



with the dialogue act to make act-speaker pairs, such as 
<SUGGEST,A^B> and <REJECT,B^A>. But we 
beheve it is more effective to use the change of speaker 
feature, which is defined to be false if the speaker of the 
current utterance is the same as the speaker of the im- 
mediately preceding utterance, and true otherwise. 

'""If the system is participating in the dialogue, rather 
than simply listening, the future context may not always 
be available. But for an utterance that is in the middle 
of a speaker's turn, it is reasonable to consider the subse- 
quent utterances within that same turn. And also, when 
utterances from the later turns do become available, it 
may be important to use this information to re-evaluate 
any dialogue acts that were computed and determine if 
the system might have misunderstood. 



tremes of using a small set of hand-picked cue 
phrases and considering the complete set of 
word n-grams, we are automating the analy- 
sis of the training corpus to determine which 
word substrings are relevant. We introduce a 
new feature called dialogue act cues: word sub- 
strings that appear frequently in dialogue and 
provide useful clues to help determine the ap- 
propriate dialogue acts. To collect dialogue act 
cues automatically from a training corpus, our 
strategy is to select word substrings of one, two, 
or three words to minimize the entropy of the 
distribution of dialogue acts given a substring. 
A substring is selected if the dialogue acts co- 
occurring with it have a sufficiently low entropy, 
discarding sparse data. Specifically, 

C =^{sgS I H(D|s)<ei A#(s)>e2} 

where C is the set of dialogue act cues, S 

is the set of word substrings, D is the set 
of dialogue acts, 9i and 62 are predefined 
thresholds, #(x) is the number of times an 

event, x, occurs in the training corpus, and 
entropy^ is defined in the standard way:^ 

H(D|s)1^'-EdeDP(d|s)log2P(d|s). 

The desirable dialogue act cues produced by 
our experiments can be organized into three cat- 
egories. Traditional cues are those cue phrases 
that have previously been reported in the lit- 
erature, such as "but" and "so"; potential cues 
consist of other useful word substrings that have 
not been considered, such as "thanks" and "see 
you"; and for dialogues from a particular do- 
main, there may be domain cues — for example, 
the appointment-scheduling corpora have dia- 
logue act cues, such as "what time" and "busy". 
Dialogue act cues in the first two categories 
can be utilized for learning general rules that 
should apply across domains, while the third 
category constitutes information that can fine- 
tune a model for a particular domain. 

But this method is not sufficiently restrictive; 
it selects many word substrings that do not sig- 

®The entropy is capturing the distribution of dialogue 
acts for utterances with a given word substring. By min- 
imizing entropy, we are selecting a word substring if it 
produces a highly skewed distribution of the dialogue 
acts, and thus, if this word substring is found in an ut- 
terance, it is relatively easy to determine the proper di- 
alogue act. 

'^In practice, we estimate the probabilities with: 
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if 




iraciitionai cues 


00 


and , because , but , so , tncn 


Potential cues 


71 


"bye" , "how 'bout" , "see you" , "sounds" , "thanks" 


Domain cues 


42 


"busy", "meet", "o'clock", "tomorrow", "what time" 


Superstring cues 
...with filtering 


690 
472 


"and then", "but the", "how 'bout the", "okay I", "so we" 
"and then", "but the", "no I", "okay with", "so we" 


Undesirable cues 


170 


"a", "be", "had", "in the", "to" 



Figure 2: A set of dialogue act cues divided into five categories 



nal dialogue acts. In many cases, an undesirable 
dialogue act cue contains a useful dialogue act 
cue as a substring, so it should be relatively easy 
to eliminate. Examples of these superstring cues 
include "but the" and "okay I". We have im- 
plemented a straightforward filtering function 
to address this problem. If a dialogue act cue, 
such as "how 'bout the" is subsumed by a more 
general dialogue act cue with a better entropy 
score, such as "how 'bout", then the first di- 
alogue act cue only offers redundant informa- 
tion, and so it should be removed from the set 
of dialogue act cues to minimize the number of 
irrelevant rules that are generated. Our filter 
deletes a dialogue act cue if one of its substrings 
happens to be another dialogue act cue with a 
better or equivalent entropy score. 

Another effective heuristic is to cluster cer- 
tain dialogue act cues into semantic classes, 
which can collapse several potential rules into 
a single rule with significantly more data sup- 
porting it. For example, in the appointment- 
scheduling corpora, there is a strong correla- 
tion between weekdays and the SUGGEST di- 
alogue act, but to express this fact, it is nec- 
essary to generate five separate rules. How- 
ever, if the five weekdays are combined un- 
der one label, "$weekday$", then the same in- 
formation can be captured by a single rule 
that has five times as much data supporting 
it: "$weekday$" =^ SUGGEST. We have ex- 
perimented with clusters, such as "$weekday$" , 
"$month$", "$number$", "$ordinal-number$" , 
and "$proper-name$" . 

We collected a set of dialogue act cues, 
clustering words in six semantic classes, with 
9i = H(T) (the entropy of the dialogue acts) 
and 02 = G- As shown in Figure 2, these dia- 
logue act cues were distributed among the four 
categories described above, with an additional 
category for the remaining undesirable cues. 
Note that our simple filtering technique success- 



fully eliminated 218 of the superstring cues. We 
plan to investigate more sophisticated filtering 
approaches to target the remaining 472 super- 
string cues. 

Limitations of TBL 

Although we have argued for the use of 
Transformation-Based Learning for dialogue act 
tagging, we have discovered a significant limita- 
tion of the algorithm: The rule templates used 
by TBL must be developed by a human, in ad- 
vance. Since the omission of any relevant tem- 
plates would handicap the system, it is essential 
that these choices be made carefully. But, in di- 
alogue act tagging, nobody knows exactly which 
features and feature interactions are relevant, so 
we would prefer to err on the side of caution by 
constructing an overly-general set of templates, 
allowing the system to learn which templates 
are effective. Unfortunately, in training, TBL 
must generate all of the potential rules for each 
utterance during each pass through the train- 
ing data, and our experimental results indicate 
that it is necessary to severely limit the number 
of potential rules that may be generated, or the 
memory and time costs are so exorbitant that 
the method becomes intractable. 

Our solution to this problem is to implement 
a Monte Carlo version of TBL to relax the re- 
striction that TBL must perform an exhaus- 
tive search. In a given pass through the train- 
ing data, for each utterance that is incorrectly 
tagged, only R of the possible template instan- 
tiations are randomly selected, where R is a pa- 
rameter that is set in advance. As long as R 
is large enough, there doesn't appear to be any 
significant degradation in performance. We be- 
lieve that this is because the best rules tend 
to be effective for many different utterances, so 
there are many opportunities to find these rules 
during training; the better a rule is, the more 
likely it is to be generated. So, although ran- 



dom sampling will miss many rules, it is still 
higlily likely to find the best rules. 

Experimental tests show that this extension 
enables the system to efficiently and effectively 
consider a large number of potential rules. This 
increases the applicability of the TBL method 
to tasks where the relevant features and feature 
interactions are not known in advance as well 
as tasks where there are many relevant features 
and feature interactions. In addition, it is no 
longer critical that the human developer iden- 
tify a minimal set of templates, and so this im- 
provement decreases the labor demands on the 
human developer. 

Unlike probabilistic machine learning ap- 
proaches, TBL fails to offer any measure of con- 
fidence in the tags that it produces. Confidence 
measures are useful in a wide variety of ways; 
for example, we foresee that our module for tag- 
ging dialogue acts can potentially be integrated 
into a larger system so that, when TBL cannot 
produce a tag with high confidence, other mod- 
ules may be invoked to provide more evidence. 
Unfortunately, due to the nature of the TBL 
method, straightforward approaches for track- 
ing the confidence of a rule during training have 
been unsuccessful. To address this problem, 
we are using the Committee-Based Sampling 
method (Dagan and Engelson, 1995) and the 
Boosting method (Preund and Schapire, 1996) 
in a novel way: The system is trained multi- 
ple times, to produce a few different but rea- 
sonable models for the training data.^ To con- 
struct these models, we adopted the strategy 
introduced in the Boosting method, by biasing 
the later models to focus on those utterances (in 
the training set) that the earlier models tagged 
incorrectly. Then, given new data, each model 
independently tags the input, and the responses 
are compared. A given tag's confidence measure 
is based on how well the different models agree 
on that tag. Our preliminary results with five 
models show that this strategy produces use- 
ful confidence measures — for nearly half of the 
utterances, all five models agreed on the tag, 
and over 90% of those tags were correct. In 
addition, the overall accuracy of our system in- 



With the efficiencies introduced by our use of fea- 
tures, dialogue act cue selection, and the Monte Carlo 
approach, we can implement modifications that require 
multiple executions of the algorithm, which would be in- 
feasible otherwise. 



creased significantly. More details on this work 
are presented in Samuel et al. (1998b). 

Experimental Results 

A survey of the other research projects that 
have applied machine learning methods to the 
dialogue act tagging task is presented in Samuel 
ct al. (1998a). The highest success rate was re- 
ported by Reithinger and Klesen (1997), whose 
system could correctly label 74.7% of the utter- 
ances in a test corpus. Their work utilized an 
N-Grams approach, in which an utterance's di- 
alogue act was based on substrings of words as 
well as the dialogue acts and speaker informa- 
tion from the preceding two utterances. Vari- 
ous probabilities were estimated from a training 
corpus by counting the frequencies of specific 
events, such as the number of times that each 
pair of consecutive words co-occurred with each 
dialogue act. 

As a direct comparison, we applied our sys- 
tem to Reithinger and Klesen's training set (143 
dialogues, 2701 utterances) and disjoint testing 
set (20 dialogues, 328 utterances), which consist 
of utterances labeled with 18 different dialogue 
acts. Using semantic clustering, 9 = 1 (the im- 
provement score threshold), R = 14 (the Monte 
Carlo sample size), a set of dialogue act cues, 
change of speaker, the dialogue act on the pre- 
ceding utterance, and other features, our sys- 
tem achieved an average accuracy score over 
five^ runs of 75.12% (a=1.34%), including a 
high score of 77.44%. We have also run di- 
rect comparisons between our system and Deci- 
sion Trees, determining that our system's per- 
formance is also comparable to this popular ma- 
chine learning method (Samuel et al., 1998b). 

Figure 3 presents a series of experiments 
which vary the set of word substrings utilized 
by the system. "^'^ Each experiment was run ten 
times, and the results were compared using a 
two-tailed t test to determine that all of the ac- 
curacy differences were significant at the 0.05 
level, except for the differences between rows 3 
Sz 4, rows 4 & 5, rows 4 & 6, rows 5 & 6, rows 
5 & 7. and rows G & 7. 

"This is to factor out the random aspect of the Monte 
Carlo method. 

^°Note that these results cannot be compared with the 
results presented above, since several parameter values 
differ between the two sets of experiments. 

^^There are only 478 different cue phrases in the set, 
but for our system, it was necessary to manipulate the 



Word Substrings 


# 


Accuracy 


None 





41.16% (<T=0.00%) 


Cue phrases (from previous literature) 


yoD 


c-i 7/10/ r\RCiO/\ 


Word n-grams 


16271 


69.21% (a=0.94%) 


Entropy minimization 


1053 


69.54% ((7=1.97%) 


Entropy minimization with clustering 


1029 


70.18% ((7=0.75%) 


Entropy minimization with filtering 


826 


70.70% ((7=1.31%) 


Entropy minimization with filtering and clustering 


811 


71.22% ((7=1.25%) 



Figure 3: Tagging accuracy on held-out data, using different sets of word substrings in training 



As the figure shows, when the system was re- 
stricted from using any word substrings, its ac- 
curacy on unseen data was only 41.16%. When 
given access to all of the cue phrases proposed 
in previous work,^^ the accuracy rises signifi- 
cantly (p < 0.001) to 61.74%. BTit this result is 
significantly lower (p < 0.001) than the 69.21% 
accuracy produced by using all substrings of 
one, two, or three words (word n-grams) in the 
training data, as Reithinger and Klesen (1997) 
did. And the entropy-minimization approach 
with the filtering and clustering techniques pro- 
duce dialogue act cues that cause the accu- 
racy to rise significantly further (p = 0.003) to 
71.22%. 

Our experimental results show that the cue 
phrases identified in the literature do not cap- 
ture all of the word substrings that signal di- 
alogue acts. On the other hand, the complete 
set of word n-grams causes the performance of 
TBL to suffer. Our dialogue act cues generate 
the highest accuracy scores, using significantly 
fewer word substrings than the word n-grams 
approach. 

Discussion 

This paper has presented the first attempt 
to apply Transformation-Based Learning to 
discoursc-lcvcl problems. Wc utilized various 
features of utterances to learn effectively from a 
relatively small amount of data, and we have de- 
veloped an entropy-minimization approach with 
filtering and clustering that automatically col- 
lects useful dialogue act cues from tagged train- 
ing data. In addition, we have devised a Monte 

data in various ways, such as including a capitalized ver- 
sion of each cue phrase and splitting up contractions. 

^^See Hirschberg and Litman (1993) and Knott (1996) 
for these lists of cue phrases. We also included 45 cue 
phrases that we pinpointed by manually analyzing a 
completely different sot of dialogues, two years before 
we began working with the VerbMobil corpora. 



Carlo strategy and a committee method to ad- 
dress some limitations of TBL. Although we 
have only begun implementing our ideas, our 
system has already matched Reithinger and 
Klesen's success rate in computing dialogue 
acts. 

In the future, we plan to implement more fea- 
tures, improve our method for collecting dia- 
logue act cues, and investigate how these mod- 
ifications improve our system's performance. 
Also, for the semantic-clustering technique, we 
selected the clusters of words by hand, but it 
would be interesting to see how a taxonomy, 
such as WordNet could be used to automate this 
process. 

When there is not enough tagged train- 
ing data available, we would like the system 
to learn from untagged data. Dagan and 
Engelson's (1995) Committee-Based Sampling 
method constructed multiple learned models 
from a small set of tagged data, and then, only 
when the models disagreed on a tag, a hu- 
man was consulted for the correct tag. Brill 
(1995b) developed an unsupervised version of 
TBL for Part-of-Speech Tagging, but this algo- 
rithm must be initialized with words that can 
be tagged unambiguously,^^ and in discourse, 
there are very few unambiguous examples. We 
intend to investigate a weakly-supervised ap- 
proach that utilizes the confidence measures de- 
scribed above. First, the system will be trained 
on a relatively small set of tagged data, pro- 
ducing a few different models. Then, given un- 
tagged data, it will use the models to derive 
dialogue acts with confidence measures. Those 
tags that receive high confidence can be used 
as unambiguous examples to drive the unsuper- 
vised version of TBL. 

While we contend that machine learning can 
be an effective tool for identifying dialogue acts, 

^^For example, "the" is always a Determiner. 



we do realize that machine learning may not be 
able to completely solve this problem, as it is 
unable to capture some relevant factors, such 
as common-sense world knowledge. We envision 
that our system may potentially be integrated 
into a larger system that uses confidence mea- 
sures to determine when world knowledge infor- 
mation is required. 
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